| - Initial layout (Created) | ||
|
- Preface (Created) - Introduction (Created) - Domain Understanding (Created) |
||
| - Phase 1 (Created) | ||
| - Phase 2 (Created) | ||
|
- Phase 2 (Extended) - Phase 3 (Created) |
||
0.6 |
14/03/2022 |
Iteration 0 (Submitted) |
|
- Iteration 0 (Approved)
- Feedback section (Added) - Feedback after Iteration 0 (Added) - Fixed spelling mistakes (Applied to whole document) - Clients' information (Added to - Phase 1) - Clients' benefits (Added to - Phase 1) - Interview planning (Added to - Phase 1) - Societal and people impact (Added to - Phase 1) - Domain understanding (Further research) |
||
|
- Removed excessive theory (Correlation and STD formulas) - Figures numbers (Added) - Conclusion for Iteration 0 (Extended) - 6.3 Evaluation (Extended) |
||
|
- Required data elements, Phase 2 (Extended) - EDA (Extended) |
||
|
- Preprocessing (Extended with outliers and no outliers sets) |
||
|
- Modeling (Added regression, changed kNN) |
||
|
- Evaluation (Added regression, updated kNN) |
||
|
- Domain Understanding (Extended points 4.1.1-4.1.6) - Conclusion Iteration 1 (Added) |
||
|
- Table of contents (updated) - Addressing feedback from Iteration 0 |
||
1.7 |
01/04/2022 | Iteration 1 (Submitted) |
|
- Iteration 1 (Approved)
- Feedback after Iteration 1 (Added) - Explanation to points (Why part, Added): * 5.3.4 * 5.3.5 * 5.3.6 * 5.3.7 * 5.4 - Phase 3 Explanation (Added, Modified): * 6.1.1 * 6.1.2 * 6.3.2 |
||
|
- Interview with expert (Answers added) - Iteration 2, Section (Created) - Phase 3 in Iteration 2(Created) |
||
|
+ Phase 4 in Iteration 2(Created) - Table of contents (updated) |
||
2.2 |
15/04/2022 |
Iteration 2 (Submitted) |
Table of contents, Iteration 0 & 1
Client
3.1 Who is my Client?
3.2 How will my Client benefit from this project?
Proposal (Phase 1)
4.1 Domain Understanding
4.1.1 Research methods
4.1.2 Facebook overview
4.1.3 Facebook users
4.1.4 Marketing on Facebook
4.1.5 Facebook pages and engagement
4.1.6 Facebook's future
4.1.7 Interview with domain expert
4.1.8 What impact has this project on society and people?
4.2 Data Sourcing
4.3 Analytic Approach
Provisioning (Phase 2)
5.1 Data Requirements
5.1.1 Domain
5.1.2 Stakeholders
5.1.3 Required Data Elements
5.1.4 Candidate Data Sources
5.2 Data Collection
5.3 Data Understanding
5.3.1 Importing libraries
5.3.2 Importing data
5.3.3 Explaining column names
5.3.4 Computing Summary Statistics
5.3.5 Visualizing Correlation
5.3.6 Standarization
5.3.7 Examining page categories
5.3.8 Examining post data
5.4 Data Preparation
Predictions (Phase 3)
6.1 Preprocessing
6.1.1 Data Standardization
6.1.2 Selecting features
6.1.3 Dividing data into train and test set
6.1.4 Removing outliers
6.1.5 Selecting features (with no outliers)
6.1.6 Dividing data into train and test set (with no outliers)
6.2 Modelling
6.2.1 Linear Regression
6.2.2 k-Nearest Neighbors
6.3 Evaluation
6.3.1 Linear Regression
6.3.2 k-Nearest Neighbors
Table of contents, Iteration 2
Na
Predictions (Phase 3)
6.1 Preprocessing
6.1.1 Removing outliers
6.1.2 Scaling features
6.1.3 Selecting features
6.1.4 Dividing data into train and test set
6.2 Modelling
6.2.1 Visualization - Linear Regression Prediction Surface
6.2.2 Visualization - SVR Prediction Surface
6.3 Evaluation
6.3.1 Support Vector Machine
Delivery (Phase 4)
7.1 Model selection
7.2 Model deployment
7.3 Application field testing
7.4 Collecting & Documenting
7.5 Presentation & Reporting
My name is Andrzej Krasnodebski and I am a 4th semester Student at Fontys University of Applied Sciences in Eindhoven, Netherlands.
I follow ICT & Business profile and currently am enrolled in Artificial Intelligence specialization.
This markdown file/document presents my personal challenge I need to perform to prove my learning outcomes. It will be updated accordingly as I progress with the course.
The goal is to make a whole Machine Learning project following the IBM Data Science Methodology consisting of 4 main phases:
Disclaimer:
This file is a whole passthrough of my work process with code, visuals, descriptions, analyses and sometimes personal thoughts or comments (and please treat them accordingly). I will do my best to create the best model for given scenario, however I will not promise anything except experiments. The idea is to showcase here everything that came up to my mind during the work process and unless something is a total educationally misleading disaster, I am going to leave it in this report for reference.
To stay connected to the every day world, people use many different social media platforms. From all of those, Facebook is one of the biggest and oldest one.
It is an American multinational technology conglomerate based in Menlo Park, California. The company is the parent organization of Facebook, Instagram, and WhatsApp, among other subsidiaries. Meta is one of the world's most valuable companies. It is one of the Big Five American information technology companies, alongside Alphabet, Amazon, Apple, and Microsoft.
Founded in 2004 by Mark Zuckerberg, currently unites over 2.9 billion users.
Project goal:
Predict the number of Facebook post shares based on the page popularity and the weekday of publishing.
My client is the owner and current administrator of an e-commerce oriented Facebook page with almost 1 million likes. It represents a profile of a company offering its' services on this market. Because of this sectors' specifics, the page is constantly monitored and updated with various content about the company, ongoing projects, job opportunities and e-commerce nuances. As a firm, they excel in digital marketing, web design and supplier selection which secures their position on the market.
For privacy reasons, the company asked not to mention their name, Facebook page and logo until the delivery of the final product.
One of their goals is to constantly improve the quality of services offered to new and current clients. In order to do that, they want to be able to predict the Facebook post share count to see weather the information, shared by them or by their client, is likely to be spread across users and by that, is interesting for them. Two factors, they want to take into account during this prediction are mother page like volume and weekday of publishing.
The company reached out to me on LinkedIn asking to take up this challenge as they believe it will be a superb learning opportunity for me and a convenient financial option for them.
3.2 How will my Client benefit from this project?
¶Algorithm predicting post share count will improve and extend services offered by my client.
It will help to calculate profitability of marketing campaigns aiming at increasing the number of likes under the facebook page. Influencing resource allocation by indicating if additional X likes will decisively improve the post shares volume and extend the reach of this page or indicate that those resources should be located somewhere else.
Moreover, the algorithm will predict the best day to publish a post resulting in high engagement and big reach in target audience.
Additionally, it may be used to help indicating trends in post sharing which will also enrich current services offered by my client.
The first phase of the project starts with focus on researching chosen domain and better understanding of the topic.
Afterwards, moving into Data Sourcing and searching for data enrichment.
Finishing with defining a clear goal for the modelling and choosing the best approach to the project.
The domain I will be researching is Facebook posts and pages, how it works and what are the behavioral factors.
This domain is part of two bigger ones:
Social Media -> Facebook -> Posts and Pages
To structure my research, I come up with research questions I will be trying to answer with my findings.
Main Research Question:
RQ: How to predict the posts' shares volume based on mother page like count and weekday of posting?
Sub Questions:
SQ1. How does Facebook position posts on the main page?
SQ2. Does the weekday and time influence posts' reach?
SQ3. What days are the best to post in order to achieve the most post shares?
SQ4. What factors influence post shares volume the most?
SQ5. Can a Machine Learning algorithm predict the posts' shares volume based on selected features?
This section is intended to document my findings about the domain I am researching and is crucial for every AI project. I will try to collect as much useful information as possible and based on that, draw meaningful conclusion afterwards, in the project. As Facebook is what I use on a weekly basis I am confident in what I already know and treat my knowledge as a solid base for this exploratory research.
Social media is a very broad topic with new insights appearing every day. Because of this fact, researching one of the biggest platforms might be overwhelming. However, it also means that there is a lot of materials easily available and finding a suitable one will not be a problem. The goal for this phase is to broaden and sort my knowledge about this domain and to make sure I do it properly I want to follow a research pattern on which I elaborate below.
There are many ICT research methods available and in a perfect scenario I would use them all and gain experience in every method. However, not every research method is suitable for every problem and a good researcher chooses the ones that suits it the most. I want to be a good researcher and in the meantime keep this section concise and valuable.
Reaching out to the ICT research methods, help me to pick the best combination for my project.
Research methods I will be using:
I start the Domain Understanding with a quick research of factors that influence posts' popularity - number of shares. The exact Facebook algorithm's that are responsible for post positioning on the main page are classified and can only be surmised.
You can find the results of my findings on the graphic below.
Every publicly available source I use is referenced here .
There are so many factors that are influential to the post popularity. I believe, most of them are publicly known but additionally there are some yet to be discovered by public.
The exact number of active facebook users is unknown, however the assumptions indicate around 2.9 billon monthly active users (Q4 2021).
"The platform surpassed two billion active users in the second quarter of 2017, taking just over 13 years to reach this milestone. In comparison, Meta-owned Instagram took 11.2 years, and Google’s YouTube took just over 14 years to achieve this landmark. As of October 2021, Facebook’s leading audience base was in India, with almost 350 million users whilst the United States ranked second with an approximate total of 193 million users. The platform also finds remarkable popularity in Indonesia and Brazil, with well over 100 million users in both countries."
- According to Statista.com
On the screenshots above to which source I reference here I can find many useful information regarding my project.
Over three-quarters of Facebook users use this site daily which results in huge engagement on profiles every day.
What is interesting and gives high hopes is that sharing with many people at once is the top reason why men use facebook and the 2nd for women.
Lets look at the age of the users.
On the amazing interactive chart which I reference here I can see that the 18-29 group till 2015 had the highest percentage of social media usage. From 2015 the age group of 30-49 joined them and from 2017/2018 the values for all 3 age groups are in close proximity. This draws a conclusion that each year social media and technology becomes more available to elderly people.
However, there is one factor that scales this effect.
People from age group 18-29 that were in the 'peak' in 2009/2010 ten years later so 2019/2020 are still using social media but are now in older age group of 30-49. This repeats also for the 'older' age groups.
Facebook original intention was to be a social network for college students, even at one time it required an .edu email for registration purposes.
Nowadays, Facebook business profiles are one of the most effective ways of marketing for their owners and help to reach many more clients for a very decent price compared to results.
According to, Statista.com 77% of Internet users are active on at least one Meta platform (Facebook is owned by Meta and is their largest platform) giving the business owners an incredible opportunity to reach enormous potential new clients, everything on one platform.
The company makes sure that setting up a business profile and creating firms image is easy, convenient and free of charge for its users making facebook's services a no-brainer in the marketing niche.
From the graph below(Source: Statista.com ) we can see that Facebook is used by its users on average 33 minutes a day which easily makes the 'marketer math':
4.1.5 Facebook pages and engagement
¶A Facebook page enables businesses, brands, celebrities, initiatives and organizations to reach their audience free of charge. Facebook profiles can be private while pages are public. Google can index a page, which will make it easier to find. You can operate your Facebook page and platforms such as Facebook Business Suite and Creator Studio on your desktop and mobile device.
A Facebook page allows ist owner to promote his company and keep in touch with users. The engagement indicator allows him to see how many people were influenced by ads on the page and its posts. Thanks to this, owner can assess the level of matching the ads to the audience. Page Activity takes into account interactions about Facebook page and its posts influenced by ads. On-page activity can include things like 'Likes', marks a post with 'Super' reactions, 'Checks' in to a location, clicking a 'Link', and so on.
Advantages of pages:
The most popular page functions:
Those are the questions many wish to have an answer to.
History continues to amaze us with its irony. The fact is that although Facebook has been successful in making the lives of billions of people public, user privacy is its future.
"The sources of change in the way people communicate are instant messaging, small communities and ephemeral content," said Mark Zuckerberg to shareholders during the last meeting on the periodic discussion of financial results for the fourth quarter of 2019.
For this reason, WhatsApp, Instagram and Messenger are becoming the main driving force behind the development of Facebook. The best evidence is the company's performance data and a look at which way money is starting to flow from the advertisers themselves.
Looking at potential new sources from which Facebook can derive new and greater profits. Even though many of us still log on to Facebook every day, the iconic blue news feed isn't as eye-catching as it used to be. The attention of users, especially the young ones, is shifting to Stories. Research shows that this Instagram format generates 15 times more engagement than any other place in the Facebook ecosystem.
This is also confirmed by marketing budgets. About 98% of Facebook's advertising revenue in the last quarter came from mobile devices. This should come as no surprise to anyone. GlobalWebIndex data shows that mobile phone traffic currently accounts for more than half of the time we spend online, and we spend half of our time there on social media.
It's hard to say exactly what the future of Facebook will be like, but the company seems to be more fortunate than smart. Privacy has become a valuable commodity. Now the company can start making money on the fact that it gives us a substitute for what it has taken away from us to a significant extent. However, there is something else worth paying attention to.
Over the last few years, Facebook has tried to impose its vision of the Internet on its users, which resulted in a drop in engagement. Recently, there has been a retreat from such activities and the company is trying to follow people, an example of which is the development of groups that came from users. Business will benefit if it follows the same path, i.e. puts people at the center of its attention. We all benefit from this.
4.1.7 Interview with domain expert
¶To further investigate the domain I will conduct an interview with expert on this field and present my findings here. Beforehand I am going to prepare questions and topics I wist to cover during the meeting and plan to moderate the discussion.
Interview with Natalia Nadolska & Karolina Dlubek marketing specialists at Digital Care group.
4.1.8 What impact has this project on society and people?
¶Next to the value that this project brings to the client there is also impact that it has on society and people.
Besides the scientific use this project is of no interest to individual people and has no direct impact on them. Of course, it may be used as a learning or research resource.
However, it might impact the society indirectly.
As Facebook and Facebook marketing is targeted at selected society target groups using this project deliverable for those purposes will affect them. For now I can not think about any negative impact this project can make as it is intended mostly for research purposes. This niche is unlike to affect society in a harmful way, not physically at least. However it is all about the usage, fake news and hate speech are one from many negative outcomes of social media abuse and despite being quite extreme situations I have to take them into account.
The second phase of the project is focused on data.
Starting with more theoretical aspects, moving into data understanding and preparation for modelling.
After Phase 1, already know from what domains I need the data from.
Domain 2 is relational to domain 1 as it is a sub-domain. Facebook domain is especially needed to full understand the main one for modelling and EDA.
Storing this kind of data is beneficial for many stakeholders in and outside those domains.
Facebook is one of the largest marketing playgrounds and gathers many companies interested in advertising there. Also, page administrators and owners want to have
insights into their products performance. On top of that, Facebook company aims to deliver its' customers the highest vale and best experience in using this platform.
To do so, they need to constantly improve their services and again, to do so, they need to measure it first.
Data required for the model consists of 2 facts and 1 dimension.
Target_Post_Share_Count and Page_Popularity being facts and Wday_Publishing as a time-related feature being a dimension.
Candidate data sources should be identified based on the facts.
The candidate data sources containing the identified data elements, should be reviewed on collection of data facts. If these facts make sense with regards to the information they contain, I may continue by focusing on the dimensions.
I have already listed several candidate data sources here.
In order to start working with data, I need to be able to understand the data. Data is usually presented in various forms:
initially it is often presented in tabular form, but to get a better view on the data it is common to visualize data in graphs, such as a line graph, histogram or pie chart.
This part answers following questions:
import numpy as np
# Numerical Python, is a library consisting of multidimensional array objects
# and a collection of routines for processing those arrays.
# Using NumPy, mathematical and logical operations on arrays can be performed.
import pandas as pd
# library for data manipulation and analysis. In particular, it offers
# data structures and operations for manipulating numerical tables and time series.
import matplotlib
import matplotlib.pyplot as plt
# plotting library for Python and its numerical mathematics extension NumPy.
# It provides an object-oriented API for embedding plots into applications
# using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK.
import seaborn as sns
# data visualization library based on matplotlib. It provides a high-level interface
# for drawing attractive and informative statistical graphics.
# sns - Samuel Norman Seaborn, fictional character serial drama The West Wing
%matplotlib inline
# sets the backend of matplotlib to the 'inline' backend: With this backend,
# the output of plotting commands is displayed inline within frontends like
# the Jupyter notebook, directly below the code cell that produced it.
# The resulting plots will then also be stored in the notebook document.
import sklearn as sk
# free software machine learning library for the Python programming language.
# It features various classification, regression and clustering algorithms.
from IPython.display import display
from sklearn.preprocessing import StandardScaler
# StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance.
from sklearn.model_selection import train_test_split
# a function in Sklearn model selection for splitting data arrays into two subsets: for training data and for testing data.
from sklearn.svm import SVC
# (Support Vector Classifier) is to fit to the data you provide, returning a "best fit" hyperplane that divides, or categorizes data.
from sklearn import metrics
# metrics module implements several loss, score, and utility functions to measure classification performance
from sklearn.neighbors import DistanceMetric
print('numpy version:', np.__version__)
print('pandas version:', pd.__version__)
print('matplotlib version:', matplotlib.__version__)
print('seaborn version:', sns.__version__)
print('scikit-learn version:', sk.__version__)
numpy version: 1.20.3 pandas version: 1.3.4 matplotlib version: 3.4.3 seaborn version: 0.11.2 scikit-learn version: 1.0.2
facebook_df = pd.read_csv("/Users/andrew/Desktop/PersonalChallenge/Dataset/Training/Features_Variant_1.csv",sep=',',decimal = ",", header=None)
# loading the facebook dataset into environment
facebook_df = pd.DataFrame(facebook_df)
# creating a dataframe
features_df = pd.read_csv("/Users/andrew/Desktop/PersonalChallenge/Dataset/FeatureNames.csv",skiprows=9,sep=',')
# loading the facebook dataset feature names into environment
Assigning and changing column names
¶print(len(facebook_df.columns))
print(len(features_df))
# calculating the number of columns to check if it matches the dataset description
54 55
Apparently there is 1 NaN row at the end that I have to get rid of.
names = features_df['feature']
# creating a list of names
names = names.dropna()
# getting rid of las NAN column
facebook_df.columns = names
# assigning meaningful names
facebook_df = facebook_df.rename(columns = {
facebook_df.columns[0]:'Page_Popularity',
facebook_df.columns[1]:'Page_Checkins',
facebook_df.columns[2]:'Page_Talking_About',
facebook_df.columns[3]:'Page_Category',
facebook_df.columns[29]:'CC1',
facebook_df.columns[30]:'CC2',
facebook_df.columns[31]:'CC3',
facebook_df.columns[32]:'CC4',
facebook_df.columns[33]:'CC5',
facebook_df.columns[34]:'Base_time',
facebook_df.columns[35]:'Post_Lenght',
facebook_df.columns[36]:'Target_Post_Share_Count',
facebook_df.columns[37]:'Post_Promotion_Status',
facebook_df.columns[38]:'H_Local',
facebook_df.columns[52]:'Base DateTime Weekday 53',
facebook_df.columns[53]:'Nr_Comments',
facebook_df.columns[39]:'Post Published Sunday',
facebook_df.columns[40]:'Post Published Monday',
facebook_df.columns[41]:'Post Published Tuesday',
facebook_df.columns[42]:'Post Published Wednesday',
facebook_df.columns[43]:'Post Published Thursday',
facebook_df.columns[44]:'Post Published Friday',
facebook_df.columns[45]:'Post Published Saturday',
})
# Fixing some typos in feature names
facebook_df.columns
Index(['Page_Popularity', 'Page_Checkins', 'Page_Talking_About',
'Page_Category', 'CC1 Min', 'CC1 Max', 'CC1 Avg', 'CC1 Median',
'CC1 Std', 'CC2 Min', 'CC2 Max', 'CC2 Avg', 'CC2 Median', 'CC2 Std',
'CC3 Min', 'CC3 Max', 'CC3 Avg', 'CC3 Median', 'CC3 Std', 'CC4 Min',
'CC4 Max', 'CC4 Avg', 'CC4 Median', 'CC4 Std', 'CC5 Min', 'CC5 Max',
'CC5 Avg', 'CC5 Median', 'CC5 Std', 'CC1', 'CC2', 'CC3', 'CC4', 'CC5',
'Base_time', 'Post_Lenght', 'Target_Post_Share_Count',
'Post_Promotion_Status', 'H_Local', 'Post Published Sunday',
'Post Published Monday', 'Post Published Tuesday',
'Post Published Wednesday', 'Post Published Thursday',
'Post Published Friday', 'Post Published Saturday',
'Base DateTime Weekday 47', 'Base DateTime Weekday 48',
'Base DateTime Weekday 49', 'Base DateTime Weekday 50',
'Base DateTime Weekday 51', 'Base DateTime Weekday 52',
'Base DateTime Weekday 53', 'Nr_Comments'],
dtype='object', name='feature')
5.3.3 Explaining column names
Click to expand the column names explanation.
1 Page_Popularity(likes) - Defines the popularity or support for the source of the document.
2 Page_Checkins - Describes how many individuals so far visited this place. This feature is only associated with the places eg:some institution, place, theater etc.
3 Page_Talking_About - Defines the daily interest of individuals towards source of the document/ Post. The people who actually come back to the page, after liking the page. This include activities such as comments, likes to a post, shares, etc by visitors to the page.
4 Page_Category - Defines the category of the source of the document eg: place, institution, brand etc.
5 - 29 These features are aggregated by page, by calculating min, max, average, median and standard deviation of essential features.
30 CC1 - The total number of comments before selected base date/time.
31 CC2 - The number of comments in last 24 hours, relative to base date/time.
32 CC3- The number of comments in last 48 to last 24 hours relative to base date/time.
33 CC4 - The number of comments in the first 24 hours after the publication of post but before base date/time.
34 CC5- The difference between CC2 and CC3.
35 Base time - Decimal(0-71) Encoding, Selected time in order to simulate the scenario.
36 Post length - Character count in the post.
37 Target_Post_Share_Count - This features counts the no of shares of the post, that how many peoples had shared this post on to their timeline.
38 Post Promotion Status - To reach more people with posts in News Feed, individual promote their post and this features tells that whether the post is promoted(1) or not(0).
39 H_Local - Decimal(0-23) Encoding, This describes the H hrs, for which we have the target variable/ comments received.
40 - 46 Post published weekday - This represents the day(Sunday...Saturday) on which the post was published.
47 - 53 Weekdays feature - This represents the day(Sunday...Saturday) on selected base Date/Time.
54 Nr_Comments - The no of comments in next H hrs(H is given in Feature no 39).
General information about the database
¶facebook_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 40949 entries, 0 to 40948 Data columns (total 54 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Page_Popularity 40949 non-null int64 1 Page_Checkins 40949 non-null int64 2 Page_Talking_About 40949 non-null int64 3 Page_Category 40949 non-null int64 4 CC1 Min 40949 non-null object 5 CC1 Max 40949 non-null object 6 CC1 Avg 40949 non-null object 7 CC1 Median 40949 non-null object 8 CC1 Std 40949 non-null object 9 CC2 Min 40949 non-null object 10 CC2 Max 40949 non-null object 11 CC2 Avg 40949 non-null object 12 CC2 Median 40949 non-null object 13 CC2 Std 40949 non-null object 14 CC3 Min 40949 non-null object 15 CC3 Max 40949 non-null object 16 CC3 Avg 40949 non-null object 17 CC3 Median 40949 non-null object 18 CC3 Std 40949 non-null object 19 CC4 Min 40949 non-null object 20 CC4 Max 40949 non-null object 21 CC4 Avg 40949 non-null object 22 CC4 Median 40949 non-null object 23 CC4 Std 40949 non-null object 24 CC5 Min 40949 non-null object 25 CC5 Max 40949 non-null object 26 CC5 Avg 40949 non-null object 27 CC5 Median 40949 non-null object 28 CC5 Std 40949 non-null object 29 CC1 40949 non-null int64 30 CC2 40949 non-null int64 31 CC3 40949 non-null int64 32 CC4 40949 non-null int64 33 CC5 40949 non-null int64 34 Base_time 40949 non-null int64 35 Post_Lenght 40949 non-null int64 36 Target_Post_Share_Count 40949 non-null int64 37 Post_Promotion_Status 40949 non-null int64 38 H_Local 40949 non-null int64 39 Post Published Sunday 40949 non-null int64 40 Post Published Monday 40949 non-null int64 41 Post Published Tuesday 40949 non-null int64 42 Post Published Wednesday 40949 non-null int64 43 Post Published Thursday 40949 non-null int64 44 Post Published Friday 40949 non-null int64 45 Post Published Saturday 40949 non-null int64 46 Base DateTime Weekday 47 40949 non-null int64 47 Base DateTime Weekday 48 40949 non-null int64 48 Base DateTime Weekday 49 40949 non-null int64 49 Base DateTime Weekday 50 40949 non-null int64 50 Base DateTime Weekday 51 40949 non-null int64 51 Base DateTime Weekday 52 40949 non-null int64 52 Base DateTime Weekday 53 40949 non-null int64 53 Nr_Comments 40949 non-null int64 dtypes: int64(29), object(25) memory usage: 16.9+ MB
The .info() function gives a fast and short overview on the dataframe.
There are 40949 data rows overall which gives high hopes for the model - a lot of data to train.
Dataset contains a lot of features, more or less needed in this project.
At first glance, I see no missing values which is a good indicator.
5.3.4 Computing Summary Statistics
Why? : Computing Summary Statistics is an essential step in the beginning of every data analysis and gives a lot of initial insights about data I am working with. Already at this stage I get to know about many computations (explained below) and can draw conclusions about the dataset.
facebook_df[['Page_Popularity','Page_Checkins','Page_Talking_About','Page_Category','Post_Lenght','Target_Post_Share_Count','Nr_Comments']].describe()
# sellecting meaningful features to compute statistics
| feature | Page_Popularity | Page_Checkins | Page_Talking_About | Page_Category | Post_Lenght | Target_Post_Share_Count | Nr_Comments |
|---|---|---|---|---|---|---|---|
| count | 4.094900e+04 | 40949.000000 | 4.094900e+04 | 40949.000000 | 40949.000000 | 40949.000000 | 40949.000000 |
| mean | 1.313814e+06 | 4676.133752 | 4.480025e+04 | 24.254780 | 163.652470 | 117.249823 | 7.322889 |
| std | 6.785752e+06 | 20593.184863 | 1.109338e+05 | 19.950583 | 376.264387 | 945.006667 | 35.494550 |
| min | 3.600000e+01 | 0.000000 | 0.000000e+00 | 1.000000 | 0.000000 | 1.000000 | 0.000000 |
| 25% | 3.673400e+04 | 0.000000 | 6.980000e+02 | 9.000000 | 38.000000 | 2.000000 | 0.000000 |
| 50% | 2.929110e+05 | 0.000000 | 7.045000e+03 | 18.000000 | 97.000000 | 13.000000 | 0.000000 |
| 75% | 1.204214e+06 | 99.000000 | 5.026400e+04 | 32.000000 | 172.000000 | 61.000000 | 3.000000 |
| max | 4.869723e+08 | 186370.000000 | 6.089942e+06 | 106.000000 | 21480.000000 | 144860.000000 | 1305.000000 |
Page_Popularity(likes) - total of 40949 values ranging from 36 to 486,972,297 with mean of 1,313,814, 25% of observations are below 36,734 likes.
Note that page like number may double in the dataset as it sometimes contain data from more than 1 post on this page.
Page_Checkins - total number of values is the same with min of o and max of 186,370 as this feature is applicable only to pages of places in witch users may check in.
Page_Talking_About - mean of 44,800 indicates that on average, people usually return to previously liked page and perhaps are interested in the content of it.
Page_Category - Categorical variable indicating the category of the page.
Post_Length - On average the posts' length equals to 163 characters with min of 0 which is a photo perhaps and max of 21480 which is a very long post.
Target_Post_Share_Count) - A lot of values for the target variable with mean od 117 which is relatively high, min of 0 and max of 144,860 which is enormous.
However, the 75% percentile is 61 which indicates that the numbers are rather close to the min than to the max.
What worries me is a high standard deviation which is over 8 times higher than the mean.
Nr_Comments - the mean nr of comments is rather low taking into account the enormous page like count and share count.
5.3.5 Visualizing Correlation on a heat map
Why? : Heat map is a convenient way to visualize the correlation of selected variables from dataset. The output is a clear overview shown by color density. This is a rather early stage but can already indicate the direction of model accuracy. I use it to check correlation between not only my target variable but also to research some other dependencies.
cols = ['Page_Popularity','Page_Checkins','Page_Talking_About','Page_Category','Post_Lenght','Target_Post_Share_Count','Nr_Comments']
# choosing columns to display
cm = np.corrcoef(facebook_df[cols].values, rowvar=0)
# calculation correlation coefficient
plt.figure(figsize=(8, 8));
sns.set(font_scale=1.2)
# adjusting font scale
hm = sns.heatmap(cm,
cbar=True, # bar on the right
annot=True, # annotating values
square=True, # making the plot square
fmt='.2f', # nr of decimals
annot_kws={'size': 15}, # size of decimals
yticklabels=cols, # setting labels
xticklabels=cols)
plt.show() # printing the plot
Graphics above gives me a lot of insights about correlation of features.
The correlation between the target variable and the likes feature equals to 0.33 which is a very weak positive correlation. It is caused by a high standard deviation
of both features and many big outliers.
In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data.
In the broadest sense correlation is any statistical association, though it actually refers to the degree to which a pair of variables are linearly related.
Standard deviation is a measure of the amount of variation or dispersion of a set of values.
A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set,
while a high standard deviation indicates that the values are spread out over a wider range.
Perhaps, after removing the outliers and normalizing the data the correlation will improve.
For now I will continue with EDA by plotting the correlation.
sns.set(style='whitegrid', context='notebook')
sns.pairplot(facebook_df, x_vars=['Page_Popularity','Page_Checkins','Page_Talking_About','Page_Category','Post_Lenght','Nr_Comments'], y_vars=['Target_Post_Share_Count'], height=3);
plt.show()
All the values are really big and to get any meaningfull insights I am going to have to normalize them.
The two most discussed scaling methods are Normalization and Standardization. Normalization typically means rescales the values into a range of [0,1]. Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).
Why? : "Data standardization is about making sure that data is internally consistent; that is, each data type has the same content and format. Standardized values are useful for tracking data that isn't easy to compare otherwise." I want to compare data that is is many different scales and would be useless to compare. That is why I standardize it.
from sklearn.preprocessing import StandardScaler
from pandas import DataFrame
# trying out the standardization
fb_normalized = facebook_df[['Page_Popularity','Page_Checkins','Page_Talking_About','Page_Category','Post_Lenght','Target_Post_Share_Count','Nr_Comments']]
# creating a working sub dataframe to check correlation on normalized data
names = ['Page_Popularity','Page_Checkins','Page_Talking_About','Page_Category','Post_Lenght','Target_Post_Share_Count','Nr_Comments']
# choosing names for normalized columns, same as previous
scaler = StandardScaler()
# defining scaler
fb_normalized = scaler.fit_transform(fb_normalized)
# normalizing the data
fb_normalized = DataFrame(fb_normalized)
# converting normalized data into pandas dataframe
fb_normalized = fb_normalized.set_axis(names, axis=1)
# changing column names
fb_normalized
| Page_Popularity | Page_Checkins | Page_Talking_About | Page_Category | Post_Lenght | Target_Post_Share_Count | Nr_Comments | |
|---|---|---|---|---|---|---|---|
| 0 | -0.100037 | -0.227075 | -0.399678 | -1.165633 | 0.006239 | -0.121958 | -0.206313 |
| 1 | -0.100037 | -0.227075 | -0.399678 | -1.165633 | -0.084124 | -0.123016 | -0.206313 |
| 2 | -0.100037 | -0.227075 | -0.399678 | -1.165633 | -0.081466 | -0.121958 | -0.206313 |
| 3 | -0.100037 | -0.227075 | -0.399678 | -1.165633 | -0.086782 | -0.123016 | -0.206313 |
| 4 | -0.100037 | -0.227075 | -0.399678 | -1.165633 | -0.057547 | -0.118783 | -0.206313 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 40944 | 0.863039 | -0.223675 | 4.076353 | -0.764638 | -0.403053 | 1.474876 | -0.178139 |
| 40945 | 0.863039 | -0.223675 | 4.076353 | -0.764638 | -0.038942 | 1.038894 | -0.149965 |
| 40946 | 0.863039 | -0.223675 | 4.076353 | -0.764638 | -0.116017 | 3.010333 | 1.822192 |
| 40947 | 0.863039 | -0.223675 | 4.076353 | -0.764638 | -0.347240 | 1.339425 | 0.582550 |
| 40948 | 0.863039 | -0.223675 | 4.076353 | -0.764638 | -0.081466 | 3.825152 | 0.103598 |
40949 rows × 7 columns
After the standarization I want to check the correlation once again. First on the heat map and then on the regular scatterplot.
Note: This is just a try-out and I am almost sure this will not change anything on the heatmap.
cols = ['Page_Popularity','Page_Checkins','Page_Talking_About','Page_Category','Post_Lenght','Target_Post_Share_Count','Nr_Comments']
# choosing columns to display
cm = np.corrcoef(fb_normalized[cols].values, rowvar=0)
# calculation correlation coefficient
plt.figure(figsize=(8, 8));
sns.set(font_scale=1.2)
# adjusting font scale
hm = sns.heatmap(cm,
cbar=True, # bar on the right
annot=True, # annotating values
square=True, # making the plot square
fmt='.2f', # nr of decimals
annot_kws={'size': 15}, # size of decimals
yticklabels=cols, # setting labels
xticklabels=cols)
plt.show() # printing the plot
As I thought, no change here.
In later phase I will remove the outliers and check the correlation once again.
I want to look at the heatmap one more time, maybe I find some better correlations.
#'Post Published Sunday','Post Published Monday','Post Published Tuesday','Post Published Wednesday','Post Published Thursday','Post Published Friday','Post Published Saturday'
cols = ['Page_Popularity','Target_Post_Share_Count','Page_Talking_About','Nr_Comments']
cm = np.corrcoef(facebook_df[cols].values, rowvar=0)
sns.set(font_scale=1.5)
hm = sns.heatmap(cm,
cbar=True,
annot=True,
square=True,
fmt='.2f',
annot_kws={'size': 15},
yticklabels=cols,
xticklabels=cols)
plt.show()
There is a high correlation between 'Page_Popularity' and 'Page_Talking_About', I might want to consider and check it later. Lets look at the scatterplot.
sns.set(style='whitegrid', context='notebook')
sns.pairplot(fb_normalized, x_vars=['Page_Popularity','Target_Post_Share_Count','Nr_Comments'], y_vars=['Page_Talking_About'], height=3);
plt.show()
Similar to the initial line. Normalizing does not change the distribution but scales the numbers. That is why the plots look identical just with different scales.
# Function to color the bar charts so similar values has similar colors
def colors_from_values(values, palette_name):
# normalize the values to range [0, 1]
normalized = (values - min(values)) / (max(values) - min(values))
# convert to indices
indices = np.round(normalized * (len(values) - 1)).astype(np.int32)
# use the indices to get the colors
palette = sns.color_palette(palette_name, len(values))
return np.array(palette).take(indices, axis=0)
x = np.array(['Product/service','Public figure','Retail and consumer merchandise','Athlete','Education website','Arts/entertainment/nightlife','Aerospace/defense',
'Actor/director','Professional sports team','Travel/leisure','Arts/humanities website','Food/beverages','Record label','Movie','Song','Community',
'Company','Artist','Non-governmental organization (ngo)','Media/news/publishing'])
y = np.array(facebook_df.groupby(['Page_Category']).size().sort_values(ascending=False).head(20))
plt.figure(figsize=(25, 6));
plot = sns.barplot(x=x,y=y, palette=colors_from_values(y, "light:#114769"))
plt.xlabel("Page category", size=14)
plt.ylabel("Total number of pages", size=14)
plt.title("Total number of pages per category (top 20 categories)", size=19)
plt.bar_label(plot.containers[0],size=16)
plt.xticks(rotation = 80, fontsize=15)
plt.show()
The histogram above illustrates the distribution of page categories. From the dataset description I loaded the category names into environment.
I can see that Product/Service pages are the most present in the data with over 7494 pages. Then public figures and retail very close and still on podium with 4511 and 4301 registered pages.
plt.figure(figsize=(12, 14));
plt.subplot(2,1,1); #the figure has 2 rows, 1 column, and this plot is the first plot.
x = np.array(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday','Saturday', 'Sunday'])
y = np.array(facebook_df.groupby(['Wday_Nr','Wday_Publishing'])['Target_Post_Share_Count'].mean())
plot = sns.barplot(x=x,y=y, palette=colors_from_values(y, "light:#A9C9D9"))
plt.ylabel("Average number", size=14)
plt.title("Average number of re-shares that posts receive over weekdays", size=19)
plt.bar_label(plot.containers[0],size=16,label_type='center')
plt.subplot(2,1,2); #the figure has 2 rows, 1 column, and this plot is the second plot.
a = np.array(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday','Saturday', 'Sunday'])
b = np.array(facebook_df.groupby(['Wday_Nr','Wday_Publishing']).size())
plot = sns.barplot(x=a,y=b, palette=colors_from_values(y, "light:#114769"))
plt.xlabel("Weekday", size=14)
plt.ylabel("Count", size=14)
plt.title("Total number of posts shared by motherpages", size=19)
plt.bar_label(plot.containers[0],size=16,label_type='center')
plt.show()
The bar chart above aims at visualizing the tendency of posts that get re-shared by users. From the top, I can retrieve that Thursdays and Sundays are weekdays with the highest rate of sharing by users. To confirm that I compare this graph with the total amount of posts shared by pages as the high mean might be caused by a high number of 'initial' shares. However, the bottom plot proves me opposite. Apparently Thursdays and especially Sundays are days with the smallest number of new posts.
Perhaps, on those days with relatively low amount of new information users get a chance to focus their attention on what is available and are more likely to pass it forward.
plt.figure(figsize=(12, 14));
plt.subplot(2,1,1); #the figure has 2 rows, 1 column, and this plot is the first plot.
x = np.array(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday','Saturday', 'Sunday'])
y = np.array(facebook_df.groupby(['Wday_Nr','Wday_Publishing'])['CC4'].mean())
plot = sns.barplot(x=x,y=y, palette=colors_from_values(y, "light:#114769"))
plt.ylabel("Average nr.", size=14)
plt.title("Average number of comments under a post in the first 24h after publishing", size=19)
plt.bar_label(plot.containers[0],size=16,label_type='center')
plt.subplot(2,1,2); #the figure has 2 rows, 1 column, and this plot is the second plot.
a = np.array(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday','Saturday', 'Sunday'])
b = np.array(facebook_df.groupby(['Wday_Nr','Wday_Publishing'])['CC4'].sum())
plot = sns.barplot(x=a,y=b, palette=colors_from_values(y, "light:#114769"))
plt.xlabel("Weekday", size=14)
plt.ylabel("Count", size=14)
plt.title("Total number of comments under a post in the first 24h after publishing", size=19)
plt.bar_label(plot.containers[0],size=16,label_type='center')
plt.show()
Similarly to the previous visualization I want to discover the amount of comments that appear under a post, in this case in 24 hours after publishing. Mean values are nearing this time with slight advantage of Wednesday and Sunday - again. I double check the result by plotting the total amount of comments grouped by weekday and make sure my conclusion is valid.
What is interesting, Sunday has both the highest mean of comment written and the lowest number of them which indicates that the mean is not caused by a huge amount of observations.
Linking both observations from previous and this visualization I can certainly see a connection with high re-share number and big amount of comments on Sunday. Perhaps a post with a lot of comments is promoted by Facebook algorithms and reaches more users which accelerates this process. However, it could work the other way around when a lot of people re-shares the post and that is the reason for outstanding number of comments. Nevertheless, both features influence or corelate with each other.
plt.figure(figsize=(12, 14));
plt.subplot(2,1,1); #the figure has 2 rows, 1 column, and this plot is the first plot.
x = np.array(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday','Saturday', 'Sunday'])
y = np.array(facebook_df.groupby(['Wday_Nr','Wday_Publishing'])['Page_Talking_About'].mean())
plot = sns.barplot(x=x,y=y, palette=colors_from_values(y, "light:#114769"))
plt.ylabel("Average nr.", size=14)
plt.title("Average number of page Talking_About" , size=19)
plt.bar_label(plot.containers[0],size=16,label_type='center')
plt.subplot(2,1,2);
a = np.array(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday','Saturday', 'Sunday'])
b = np.array(facebook_df.groupby(['Wday_Nr','Wday_Publishing'])['Page_Talking_About'].sum())
plot = sns.barplot(x=a,y=b, palette=colors_from_values(y, "light:#114769"))
plt.ylabel("Sum", size=14)
plt.title("Total number of page Talking_About" , size=19)
plt.bar_label(plot.containers[0],size=10,label_type='center')
plt.xlabel("Weekday", size=14)
plt.show()
Page_Talking_About defines the daily interest of individuals towards source of the document/ Post. People who actually come back to the page, after liking the page. This include activities such as comments, likes to a post, shares, etc by visitors to the page.
The bar chart above indicates that Wednesday has the most engagement through the week, similarly to the number of comments and new posts that are being shared.
However, this time it is most likely caused by high number of total engagement recoded it my data. For this reason I might want to check the median value.
plt.figure(figsize=(12, 7));
x = np.array(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday','Saturday', 'Sunday'])
y = np.array(facebook_df.groupby(['Wday_Nr','Wday_Publishing'])['Page_Talking_About'].median())
plot = sns.barplot(x=x,y=y, palette=colors_from_values(y, "light:#114769"))
plt.ylabel("Median", size=14)
plt.title("Median number of page Talking_About" , size=19)
plt.bar_label(plot.containers[0],size=16,label_type='center')
plt.xlabel("Weekday", size=14)
plt.show()
And again, SUNDAY! The median (10938) is almost 4 times smaller than the mean (47140) which tells me that data is poorly distributed and skewed.
The median represents the 50th percentile of a dataset. That is, exactly half of the values in the dataset are larger than the median and half of the values are lower.
Also, it is an important metric to calculate because it gives us an idea of where the “center” of a dataset is located. It also gives us an idea of the “typical” value in a given dataset.
Why? : I have already made some adjustments to the initial dataset to even start the project and I think the data is pretty clean. However, before moving to the modelling phase I need to make sure some essential conditions are in place. Data preparation ensures analysts trust, understand, and ask better questions of their data, making their analyses more accurate and meaningful. From more meaningful data analysis comes better insights and, of course, better outcomes.
Check for duplicated values
¶facebook_df.duplicated().value_counts()
False 40941 True 8 dtype: int64
There are 8 duplicated values in the data overall which is a small number. Its influence to the whole dataset is rather none.
However, I will get rid of duplicates anyway.
facebook_df = facebook_df.drop_duplicates()
print(facebook_df.duplicated().value_counts())
False 40941 dtype: int64
Check for missing values
¶In order to avoid bias I need to fill all null values from the database. I start with calculation null values for each column.
facebook_df[facebook_df.columns[facebook_df.isnull().any()]].isnull().sum()
# select columns with count which have at least 1 null value
Series([], dtype: float64)
Fortunately, the dataset is complete and does not require any data to be filled up.
Aggregate weekday of publishing into one column
¶Now, the weekday of publishing is split into 7 columns with 0 and 1 indicators. To use this feature as a class for classification I need to merge it into one column.
# to do it I use the map function specifing the column and value for which I want to transfer the day
facebook_df['Wday_Publishing'] = facebook_df['Post Published Sunday'].map({1: 'Sunday'})
facebook_df['Wday_Publishing'] = facebook_df['Wday_Publishing'].fillna(facebook_df['Post Published Monday'].map({1: 'Monday'}))
facebook_df['Wday_Publishing'] = facebook_df['Wday_Publishing'].fillna(facebook_df['Post Published Tuesday'].map({1: 'Tuesday'}))
facebook_df['Wday_Publishing'] = facebook_df['Wday_Publishing'].fillna(facebook_df['Post Published Wednesday'].map({1: 'Wednesday'}))
facebook_df['Wday_Publishing'] = facebook_df['Wday_Publishing'].fillna(facebook_df['Post Published Thursday'].map({1: 'Thursday'}))
facebook_df['Wday_Publishing'] = facebook_df['Wday_Publishing'].fillna(facebook_df['Post Published Friday'].map({1: 'Friday'}))
facebook_df['Wday_Publishing'] = facebook_df['Wday_Publishing'].fillna(facebook_df['Post Published Saturday'].map({1: 'Saturday'}))
facebook_df['Wday_Nr'] = facebook_df['Wday_Publishing'].map({'Monday':1, 'Tuesday':2, 'Wednesday':3, 'Thursday':4, 'Friday':5, 'Saturday':6, 'Sunday':7})
# Now transfer the weekdays to numbers
Check data types
¶facebook_df[['Page_Popularity','Page_Checkins','Page_Talking_About','Page_Category','Post_Lenght','Target_Post_Share_Count','Nr_Comments', 'Wday_Publishing']].dtypes
feature Page_Popularity int64 Page_Checkins int64 Page_Talking_About int64 Page_Category int64 Post_Lenght int64 Target_Post_Share_Count int64 Nr_Comments int64 Wday_Publishing object dtype: object
The datatype for interesting columns are correct. Interesting columns are the ones needed for modeling and continuing EDA:
To save space in the document I only print chosen features types.
That is where my initial EDA ends, I will be returning to and improving it as project moves forward.
The 3rd phase is model training and evaluating.
In the 1 st iteration I will be using two algorithms to check which one solves my problem better.
kNN - which is both used for classification and regression, and linear regression. Perhaps, in next iterations I am also going to use different algorithms but for now I want to focus on those.
K-Nearest Neighbors
¶I start with standardizing the values as it is necessary for the algorithms. This process allows me to compare scores between different types of variables.
from sklearn.preprocessing import StandardScaler
from pandas import DataFrame
# trying out the standardization
fb_normalized = facebook_df[['Page_Popularity','Target_Post_Share_Count','Wday_Nr']]
# creating a working sub dataframe to check correlation on normalized data
names = ['Page_Popularity','Target_Post_Share_Count','Wday_Nr']
# choosing names for normalized columns, same as previous
scaler = StandardScaler()
# defining scaler
fb_normalized = scaler.fit_transform(fb_normalized)
# normalizing the data
fb_normalized = DataFrame(fb_normalized)
# converting normalized data into pandas dataframe
fb_normalized = fb_normalized.set_axis(names, axis=1)
# changing column names
fb_normalized
| Page_Popularity | Target_Post_Share_Count | Wday_Nr | |
|---|---|---|---|
| 0 | -0.100014 | -0.121965 | -0.459813 |
| 1 | -0.100014 | -0.123023 | 0.051075 |
| 2 | -0.100014 | -0.121965 | 0.561964 |
| 3 | -0.100014 | -0.123023 | 0.561964 |
| 4 | -0.100014 | -0.118790 | -1.481590 |
| ... | ... | ... | ... |
| 40936 | 0.862980 | 1.474715 | -0.459813 |
| 40937 | 0.862980 | 1.038776 | -0.459813 |
| 40938 | 0.862980 | 3.010024 | -0.459813 |
| 40939 | 0.862980 | 1.339277 | -0.459813 |
| 40940 | 0.862980 | 3.824764 | -0.459813 |
40941 rows × 3 columns
In this step I choose the features I will need for modelling. Those features were already chosen at the beginning of this project and are as follows:
X - Popularity of mother page and week day of publishing
Y - Target variable, a value the model will predict, Number of post share count
Those features were chosen during the domain understanding as according to my research are dependent on each other.
# Define X_fb and y_fb
X_fb = fb_normalized[['Page_Popularity','Wday_Nr']].values #'Wday_Nr'
y_fb = fb_normalized['Target_Post_Share_Count'].values
print('Wday_Publishing types:', facebook_df['Wday_Nr'].unique())
# Normalize
#scaler_fb = StandardScaler().fit(X_fb)
# StandardScaler standardizes a feature by subtracting the mean and then
# scaling to unit variance. Unit variance means dividing all the values
# by the standard deviation. StandardScaler makes the mean of the distribution
# approximately 0.
#X_fb = scaler_fb.transform(X_fb)
print('The length of X_fb: {}'.format(len(X_fb)))
plt.rcParams["figure.figsize"] = (10, 8)
plt.scatter(X_fb[:,0], y_fb, edgecolors='k', c=facebook_df['Wday_Nr'])
Wday_Publishing types: [3 4 5 1 2 7 6] The length of X_fb: 40941
<matplotlib.collections.PathCollection at 0x7fcf8a433e80>
6.1.3 Dividing data into a training and test set
¶I am using the train_test_split() function to divide the dataset into training and testing pieces. I choose to use test size of 0.2 as the correlation is pretty low
and I want to sacrifice more data for training.
# Split into train and test sets
X_train_fb, X_test_fb, y_train_fb, y_test_fb = train_test_split(X_fb, y_fb, test_size=0.2, random_state=0)
print('Train shape:', X_train_fb.shape, y_train_fb.shape)
print('Test shape:', X_test_fb.shape, y_test_fb.shape)
Train shape: (32752, 2) (32752,) Test shape: (8189, 2) (8189,)
I remove the outliers to stay with pure data with no exceptions from the majority.
# checking the min and max values first
X = fb_normalized[['Page_Popularity']].values
y = fb_normalized['Target_Post_Share_Count'].values
print('-------------BEFORE REMOVING OUTLIERS-------------')
print('Max value of X: '+ str(X.max()))
print('MEAN value of X: '+ str(X.mean()))
print('Min value of X: '+ str(X.min()))
print('STD value of X: '+ str(X.std()))
print('')
print('Max value of y: '+ str(y.max()))
print('MEAN value of y: '+ str(y.mean()))
print('Min value of y: '+ str(y.min()))
print('STD value of y: '+ str(y.std()))
print('')
# I have (want) to manually select values which are bigger and smaller from its mean by its std
fb_normalized_2 = fb_normalized[fb_normalized['Page_Popularity'] < fb_normalized['Page_Popularity'].mean() + fb_normalized['Page_Popularity'].std()]
fb_normalized_2 = fb_normalized_2[fb_normalized['Page_Popularity'] > fb_normalized['Page_Popularity'].mean() - fb_normalized['Page_Popularity'].std()]
fb_normalized_2 = fb_normalized_2[fb_normalized['Target_Post_Share_Count'] < fb_normalized['Target_Post_Share_Count'].mean() + fb_normalized['Target_Post_Share_Count'].std()]
fb_normalized_2 = fb_normalized_2[fb_normalized['Target_Post_Share_Count'] > fb_normalized['Target_Post_Share_Count'].mean() - fb_normalized['Target_Post_Share_Count'].std()]
X = fb_normalized_2[['Page_Popularity']].values
y = fb_normalized_2['Target_Post_Share_Count'].values
print('')
print('-------------AFTER REMOVING OUTLIERS-------------')
print('Max value of X: '+ str(X.max()))
print('MEAN value of X: '+ str(X.mean()))
print('Min value of X: '+ str(X.min()))
print('STD value of X: '+ str(X.std()))
print('')
print('Max value of y: '+ str(y.max()))
print('MEAN value of y: '+ str(y.mean()))
print('Min value of y: '+ str(y.min()))
print('STD value of y: '+ str(y.std()))
print('')
print('')
print('')
-------------BEFORE REMOVING OUTLIERS------------- Max value of X: 71.56513567379072 MEAN value of X: -3.1239513552873168e-18 Min value of X: -0.19357952443119697 STD value of X: 1.0000000000000004 Max value of y: 153.1528973226614 MEAN value of y: 2.7768456491442817e-18 Min value of y: -0.12302283032551276 STD value of y: 1.0000000000000002 -------------AFTER REMOVING OUTLIERS------------- Max value of X: 0.9968781375978378 MEAN value of X: -0.06107551487612045 Min value of X: -0.19357952443119697 STD value of X: 0.21917812739533596 Max value of y: 0.9975095455849459 MEAN value of y: -0.05939195495823606 Min value of y: -0.12302283032551276 STD value of y: 0.14049948576864216
/var/folders/v9/szm7dr5x3j55c958f77yjjk80000gn/T/ipykernel_58316/3306675629.py:19: UserWarning: Boolean Series key will be reindexed to match DataFrame index. /var/folders/v9/szm7dr5x3j55c958f77yjjk80000gn/T/ipykernel_58316/3306675629.py:21: UserWarning: Boolean Series key will be reindexed to match DataFrame index. /var/folders/v9/szm7dr5x3j55c958f77yjjk80000gn/T/ipykernel_58316/3306675629.py:22: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
From feedback after iteration 0 I got a hint to try modeling without removing outliers and that is why I will be computing everything with data with and without outliers.
The script above prints the max, mean, min and std of variables, than removes the outliers and prints the summary one more time to compare the output.
6.1.5 Selecting features with no outliers
¶# Define X_fb and y_fb
X_fb_o = fb_normalized_2[['Page_Popularity','Wday_Nr']].values # ',Wday_Nr'
y_fb_o = fb_normalized_2['Target_Post_Share_Count'].values
print('Wday_Publishing types:', fb_normalized_2['Wday_Nr'].unique())
print('The length of X_fb_o: {}'.format(len(X_fb_o)))
plt.rcParams["figure.figsize"] = (10, 8)
plt.scatter(X_fb_o[:,0], y_fb_o, edgecolors='k', c=fb_normalized_2['Wday_Nr'])
Wday_Publishing types: [-0.4598134 0.05107513 0.56196365 -1.48159045 -0.97070193 1.58374071 1.07285218] The length of X_fb_o: 39501
<matplotlib.collections.PathCollection at 0x7fcf8aaccfd0>
The plott looks much better now. Values are scaled and with no outliers.
6.1.6 Dividing data into train and test set with no outliers
¶To easily toggle between features I simply create a "_o" suffix to the variable names.
# Split in train and test sets
X_train_fb_o, X_test_fb_o, y_train_fb_o, y_test_fb_o = train_test_split(X_fb_o, y_fb_o, test_size=0.2, random_state=0)
print('Train shape:', X_train_fb_o.shape, y_train_fb_o.shape)
print('Test shape:', X_test_fb_o.shape, y_test_fb_o.shape)
plt.rcParams["figure.figsize"] = (10, 8)
plt.scatter(X_train_fb_o[:,0], y_train_fb_o, edgecolors='k', c=y_train_fb_o)
Train shape: (31600, 2) (31600,) Test shape: (7901, 2) (7901,)
<matplotlib.collections.PathCollection at 0x7fcf7a165850>
I start modeling with linear regression. To have a clear comparison I will create 2 models and print both results.
from sklearn.linear_model import LinearRegression
slr = LinearRegression()
slr.fit(X_train_fb, y_train_fb)
y_train_pred = slr.predict(X_train_fb)
y_test_pred = slr.predict(X_test_fb)
print('-------------------WITH OUTLIERS-------------------')
print('Slope: %.3f' % slr.coef_[0])
print('Intercept: %.3f' % slr.intercept_)
print('')
slr.fit(X_train_fb_o, y_train_fb_o)
y_train_pred_o = slr.predict(X_train_fb_o)
y_test_pred_o = slr.predict(X_test_fb_o)
print('-------------------WITHOUT OUTLIERS-------------------')
print('Slope_o: %.3f' % slr.coef_[0])
print('Intercept_o: %.3f' % slr.intercept_)
-------------------WITH OUTLIERS------------------- Slope: 0.217 Intercept: -0.006 -------------------WITHOUT OUTLIERS------------------- Slope_o: 0.206 Intercept_o: -0.047
Both models are similar both in slope and intercept.
For the kNN I will be using KNeighborsRegressor() which is a regression equivalent of kNeighborsClassifier.
To find the best K value which results in lowest RMSE value I will check all options in range 1-20 both for 'initial' features as well as for 'no outliers' features.
from sklearn import neighbors
from sklearn.metrics import mean_squared_error
from math import sqrt
print('-------------------WITH OUTLIERS-------------------')
rmse_val2 = [] # store rmse values for different k
for K in range(20):
K = K+1
model = neighbors.KNeighborsRegressor(n_neighbors = K)
model.fit(X_train_fb, y_train_fb) #fit the model
pred=model.predict(X_test_fb) #make prediction on test set
error = sqrt(mean_squared_error(y_test_fb,pred)) #calculate rmse
rmse_val2.append(error) #store rmse values
print('RMSE value for k =' , K , 'is:', error)
print('')
print('')
print('-------------------WITHOUT OUTLIERS-------------------')
rmse_val = [] # store rmse values for different k
for K in range(20):
K = K+1
model = neighbors.KNeighborsRegressor(n_neighbors = K)
model.fit(X_train_fb_o, y_train_fb_o) #fit the model
pred=model.predict(X_test_fb_o) #make prediction on test set
error = sqrt(mean_squared_error(y_test_fb_o,pred)) #calculate rmse
rmse_val.append(error) #store rmse values
print('RMSE value for k =' , K , 'is:', error)
-------------------WITH OUTLIERS------------------- RMSE value for k = 1 is: 1.9716500158911145 RMSE value for k = 2 is: 1.9196281175641887 RMSE value for k = 3 is: 1.9147326465599128 RMSE value for k = 4 is: 1.9230238146086547 RMSE value for k = 5 is: 1.9292926159806763 RMSE value for k = 6 is: 1.935581693197842 RMSE value for k = 7 is: 1.9389239946257084 RMSE value for k = 8 is: 1.942644946411843 RMSE value for k = 9 is: 1.945269853134075 RMSE value for k = 10 is: 1.947763400568935 RMSE value for k = 11 is: 1.9472707563707559 RMSE value for k = 12 is: 1.949485451016581 RMSE value for k = 13 is: 1.953782677002893 RMSE value for k = 14 is: 1.955150466053564 RMSE value for k = 15 is: 1.9558959342735387 RMSE value for k = 16 is: 1.9571065934359255 RMSE value for k = 17 is: 1.9587145172313563 RMSE value for k = 18 is: 1.9597961195244162 RMSE value for k = 19 is: 1.96094663635039 RMSE value for k = 20 is: 1.9606627496591325 -------------------WITHOUT OUTLIERS------------------- RMSE value for k = 1 is: 0.1591643455763324 RMSE value for k = 2 is: 0.13766853505829216 RMSE value for k = 3 is: 0.12990316188878837 RMSE value for k = 4 is: 0.1282561624001337 RMSE value for k = 5 is: 0.125197919933395 RMSE value for k = 6 is: 0.12407129860050344 RMSE value for k = 7 is: 0.12317276523318878 RMSE value for k = 8 is: 0.12289352625992161 RMSE value for k = 9 is: 0.12261050341799243 RMSE value for k = 10 is: 0.12217409066825947 RMSE value for k = 11 is: 0.12188825094448755 RMSE value for k = 12 is: 0.12235733914745127 RMSE value for k = 13 is: 0.12192341220663916 RMSE value for k = 14 is: 0.12218101755081948 RMSE value for k = 15 is: 0.12180433360424067 RMSE value for k = 16 is: 0.12188114322715268 RMSE value for k = 17 is: 0.12182695722586119 RMSE value for k = 18 is: 0.12193763379661028 RMSE value for k = 19 is: 0.12212649262968502 RMSE value for k = 20 is: 0.12250264000236873
Root Mean Square Error (RMSE) is a standard way to measure the error of a model in predicting quantitative data.
I believe the best K value is somewhere around 10. This time the no outlier data results in significantly better RMSE value. I think it will be clearly visible on the plot.
#plotting the rmse values against k values
curve = pd.DataFrame(rmse_val2)
curve.plot()
curve = pd.DataFrame(rmse_val) #elbow curve
curve.plot()
<AxesSubplot:>
Both plots illustrates the same thing - influence of K value in change of RMSE. While the upper plot has no clear patter the bottom one is easily readable. The RMSE decreases while K increases with local minimum around 7 - 10. In order to check the best K value I can use GridSearchCV feature from sklearn.
from sklearn.model_selection import GridSearchCV
params = {'n_neighbors':[2,3,4,5,6,7,8,9,10,11,12,13,14]} # iterating the possible options read from plot
knn = neighbors.KNeighborsRegressor()
model = GridSearchCV(knn, params, cv=5) # searching for best value
model.fit(X_train_fb_o,y_train_fb_o)
model.best_params_ # printing best K value
{'n_neighbors': 10}
The code above takes some time to run as it has to iterate over all values I specify. The more values the bigger time. Finally I can create the model with already known best K value
model = neighbors.KNeighborsRegressor(n_neighbors = 10)
model.fit(X_train_fb_o, y_train_fb_o) #fit the model
y_pred_fb_o = model.predict(X_test_fb_o) #make prediction on test set
While creating the model I specify the KNeighborsClassifier() as 10.
Here are some things to keep in mind:
As we decrease the value of K to 1, our predictions become less stable.
Inversely, as we increase the value of K, our predictions become more stable due to majority voting / averaging,
and thus, more likely to make more accurate predictions (up to a certain point).
Eventually, we begin to witness an increasing number of errors. It is at this point we know we have pushed the value of K too far.
In cases where we are taking a majority vote (e.g. picking the mode in a classification problem) among labels, we usually make K an odd number to have a tiebreaker."
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
print('-------------------WITH OUTLIERS-------------------')
print('MSE train: %.3f, test: %.3f' % (
mean_squared_error(y_train_fb, y_train_pred),
mean_squared_error(y_test_fb, y_test_pred)))
print('R^2 train: %.3f, test: %.3f' %
(r2_score(y_train_fb, y_train_pred),
r2_score(y_test_fb, y_test_pred)))
print('MAE train: %.3f, test: %.3f' %
(mean_absolute_error(y_train_fb, y_train_pred),
mean_absolute_error(y_test_fb, y_test_pred)))
print('RMSE train: %.3f, test: %.3f' %
(np.sqrt(mean_squared_error(y_train_fb, y_train_pred)),
np.sqrt(mean_squared_error(y_test_fb, y_test_pred))))
print('')
print('-------------------WITHOUT OUTLIERS-------------------')
print('MSE_o train: %.3f, test: %.3f' % (
mean_squared_error(y_train_fb_o, y_train_pred_o),
mean_squared_error(y_test_fb_o, y_test_pred_o)))
print('R^2_o train: %.3f, test: %.3f' %
(r2_score(y_train_fb_o, y_train_pred_o),
r2_score(y_test_fb_o, y_test_pred_o)))
print('MAE_o train: %.3f, test: %.3f' %
(mean_absolute_error(y_train_fb_o, y_train_pred_o),
mean_absolute_error(y_test_fb_o, y_test_pred_o)))
print('RMSE_o train: %.3f, test: %.3f' %
(np.sqrt(mean_squared_error(y_train_fb_o, y_train_pred_o)),
np.sqrt(mean_squared_error(y_test_fb_o, y_test_pred_o))))
-------------------WITH OUTLIERS------------------- MSE train: 0.222, test: 3.628 R^2 train: 0.157, test: 0.080 MAE train: 0.148, test: 0.178 RMSE train: 0.471, test: 1.905 -------------------WITHOUT OUTLIERS------------------- MSE_o train: 0.018, test: 0.017 R^2_o train: 0.102, test: 0.109 MAE_o train: 0.072, test: 0.072 RMSE_o train: 0.133, test: 0.132
MSE - Mean Squared Error is a most used and very simple metric with a little bit of change in mean absolute error. Mean squared error states that finding the squared difference between actual and predicted value. The lower it gest - the model is better. 1 - 0 for no outliers.
MAE - Mean Absolute Error is a very simple metric which calculates the absolute difference between actual and predicted values. Again the lower the better. 2 - 0 for no outliers.
RMSE - Root Mean Squared Error is clear by the name itself, that it is a simple square root of mean squared error. The output value is in the same unit as the required output variable which makes interpretation of loss easy. 3 - 0 for no outliers.
R^2 - R Squared is a metric that tells the performance of model, not the loss in an absolute sense that how many wells did model perform. With help of R squared I have a baseline model to compare a model which none of the other metrics provides. The same as in classification problems which are call a threshold which is fixed at 0.5. So basically R2 squared calculates how must regression line is better than a mean line. This time a draw, 4 - 1 still for no outliers.
Overall the model created from features with no outliers performs much better.
error = sqrt(mean_squared_error(y_test_fb_o, y_pred_fb_o)) #calculate rmse
score = model.score(X_train_fb_o, y_train_fb_o)
print('Number of test points: ',X_test_fb_o.size)
print('RMSE value :', error)
print('Model score :', score)
Number of test points: 15802 RMSE value : 0.12217409066825947 Model score : 0.33645789322205377
The score of 33% is significantly better that the one from Iteration 0. I did spend much more time on preprocessing and modelling which results you see above.
Below I will comment on my previous evaluation adding points for this (improved) one.
The accuracy score of 0.19% is actually disappointing. In the next iteration I will investigate its cause and apply solutions.
I did not expected a 100% accuracy however, with this correlation I did expect somewhat higher and a model with this accuracy is rather useless.
It might be caused by my mistake somewhere in the process or by choosing not the best algorithm. Anyway, I will improve this project in further iterations and hopefully so will the accuracy score.
Possible reasons for low accuracy I see so far:
The second iteration brings a positive outcome to this project with the accuracy increased to 33% which indicates an improvement to the model. I have confirmed my initial thoughts about the goal of this project not being about the accuracy but about the whole process of walking through the methodology.
The algorithm I originally used for Iteration 0 was indeed suitable for the problem, however in the wrong 'settings' of classification instead of regression resulting in a complete opposite output. Lessons learned, algorithm fixed.
Iteration 1 also brings a new discovery that data with outliers removed suits the model better and results in higher accuracy. This had to be checked and that is why Phases 2 & 3 are divided into 'Outliers' and 'No Outliers' sections. The mistake in feature selection, which was localized by my teacher, is now repaired and causes no more errors in chunks below.
Experiments were also part of this submission. With MinMaxValues method from Sklearn package I tried to normalize values for the model. Unfortunately, I got a bit lost in the documentation, did not want to prolong the submission date and gave up on this idea.
In order to chose the best hyperparameters for algorithm I used a script that does it for me and assures the best results.
That being summarized, I consider this iteration as a major improvement to the initial project and am looking forward to the next ones.
Iteration 0
¶Iteration 1
¶Go back to Table of contents.
This section includes all feedback I received on this project. The idea is to make it transparent and easily accessible.
Feedback Iteration 0 + Addressing it in Iteration 1
¶
- No need to explain theory - Indeed, I removed most of theory explanation to save space in the doc.
- Document is highly verbose - I tried to limit some parts by introducing toggle buttons and made moving through doc easier by adding links to sections.
- You removed the Target Variable - Wanted to experiment but didn't mention in, now its fixed.
- Maybe this is a regression problem - Haha Indeed, I fixed kNN to 'regression' settings.
- Maybe you removed too many outliers - I did check it and no outliers set gives better score.
- Maybe you selected the wrong features - Perhaps, I stick to my initial plan, however I want to talk it over with you and see my options.
- address my feedback - I sure did :).
- really nice that you made such an extensive data analysis - I try.
- results that are actually carrying valuable information - Indeed, I believe you will like my new plots which actually do so.
- try to be more focused towards one goal - I think I understand and tried to keep my EDA on point.
- References, from the meeting - fixed now.
- Who is your (fictional) client? - I added this section.
- What would be the added value - Also added in this Iteration.
- Way to interview an expert on this subject - It is, I am scheduling it rn and already created a plan(in this doc)
- consider the additional impact your project could have on people or society - Added in this iteration.
- there are quite a few typos - Yeah, speed writing in markdown... I equipped myself with a spell check software - any blame on it from now on :)
Feedback Iteration 1
¶- Data analytics & Investigative analysis
Go back to Table of contents.
Shanika Wickramasinghe(2021)Bias & Variance in Machine Learning: Concepts & Tutorials. Retrieved 03:15, March 6, 2022, from
https://www.bmc.com/blogs/bias-variance-machine-learning/#:~:text=What%20is%20bias%20in%20machine,assumptions%20in%20the%20ML%20process
Wikipedia contributors. (2022, February 24). K-nearest neighbors algorithm. In Wikipedia, The Free Encyclopedia. Retrieved 09:06, March 14, 2022, from
https://en.wikipedia.org/w/index.php?title=K-nearest_neighbors_algorithm&oldid=1073696056
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml].
Irvine, CA: University of California, School of Information and Computer Science.
Moro, S., et al., Predicting social media performance metrics and evaluation of the impact on brand building:
A data mining approach, Journal of Business Research (2016), http://dx.doi.org/10.1016/j.jbusres.2016.02.010
Y. Zhao, Y. Zhang, Comparison of decision tree methods for finding active objects, Advances in Space Research 41 (12) (2008) 1955–1959.
Kamaljot Singh. & Ranjeet Kaur.(2015).Comment Volume Prediction using Neural Networks and Decision Trees.
Retrieved 11:32, March 3, 2022, from
https://www.researchgate.net/profile/Kamaljot-Singh-2/publication/301284745_Comment_Volume_Prediction_using_Neural_Networks_and_Decision_Trees/links/570f3ce808aecd31ec9a95bf/Comment-Volume-Prediction-using-Neural-Networks-and-Decision-Trees.pdf
Onel Harrison.(Sep 10, 2018).Machine Learning Basics with the K-Nearest Neighbors Algorithm.
Retrieved March 14, 2022, from
https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761
Aishwarya Singh.A Practical Introduction to K-Nearest Neighbors Algorithm for Regression.
Retrieved March 30, 2022, from
https://www.analyticsvidhya.com/blog/2018/08/k-nearest-neighbor-introduction-regression-python/
Raghav Agrawal.(May 19, 2021).Know The Best Evaluation Metrics for Your Regression Model !
Retrieved March 30, 2022, from
https://www.analyticsvidhya.com/blog/2021/05/know-the-best-evaluation-metrics-for-your-regression-model/
Bonestroo, W.J., Meesters, M., Niels, R., Schagen, J.D., Henneke, L., Turnhout, K. van (2018): ICT Research Methods. HBO-i, Amsterdam. ISBN/EAN: 9990002067426.
Available from: http://www.ictresearchmethods.nl/
Number of monthly active Facebook users worldwide as of 4th quarter 2021. Statista.com
Retrieved March 31, 2022, from
https://www.statista.com/statistics/264810/number-of-monthly-active-facebook-users-worldwide/
Visobe, & Mohamud. (2019, July 12). Why it's important to standardize your data - atlan: Humans of data. Atlan.
Retrieved April 12, 2022, from https://humansofdata.atlan.com/2018/12/data-standardization/
Go back to Table of contents.
Iteration 2
In this Iteration I want to focus on extending Phase 3 with introducing a new algorithm and trying out different features. Additionally, I will create a first version of delivery Phase 4, which will be later improved in final Iteration.
A new algorithm I want to try out, is Support Vector Machine. It's original purpose was for classification problems, however there is also a regression option.
This time I am going to dive into the documentation much more than I did for the First Iteration kNN which resulted in a classification model predicting continuos variable.
Looking at the heat map from previous iterations, I realized a relatively high (for this data) correlation between 'Page_Popularity' and 'Page_Talking_About'.
My target variable stays the same for the whole project - 'Target_Post_Share_Count' and I want to switch 'Wday_Nr' to 'Page_Talking_About'.
Additionally, SVMs despite being a complicated algorithms, give a challenging visualization opportunities which I plan to take up.
This Iteration is intended to have a different structure from previous one (no more rollercoaster) and will be basing on only one set of features with no outliers as that kind of set is already proven to have better model performance.
As I will be using some new libraries, I load them here (Iteration 2) to have an easy access and prevent scrolling through whole document.
# Data Manipulation
import pandas as pd # for data manipulation
import numpy as np # for data manipulation
# Sklearn
from sklearn.linear_model import LinearRegression # for building a linear regression model
from sklearn.svm import SVR # for building SVR model
from sklearn.preprocessing import MinMaxScaler
# Visualizations
import plotly.graph_objects as go # for data visualization
import plotly.express as px # for data visualization
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
df = facebook_df[['Page_Popularity','Page_Talking_About','Target_Post_Share_Count']]
#df = df.drop_duplicates()
In the next step I remove the outliers to stay with pure data with no exceptions from the majority. I do it manually as I already created a working script for it. However, I am aware of availability of ready functions that could do it easier.Perhaps, I will try to use them in the future.
# checking the min and max values first
X = df[['Page_Popularity']].values
y = df['Target_Post_Share_Count'].values
z = df['Page_Talking_About'].values
print('-------------BEFORE REMOVING OUTLIERS-------------')
print('Max value of X: '+ str(X.max()))
print('MEAN value of X: '+ str(X.mean()))
print('Min value of X: '+ str(X.min()))
print('STD value of X: '+ str(X.std()))
print('')
print('Max value of y: '+ str(z.max()))
print('MEAN value of y: '+ str(y.mean()))
print('Min value of y: '+ str(y.min()))
print('STD value of y: '+ str(y.std()))
print('')
print('Max value of z: '+ str(z.max()))
print('MEAN value of z: '+ str(z.mean()))
print('Min value of z: '+ str(z.min()))
print('STD value of z: '+ str(z.std()))
print('')
# I have (want) to manually select values which are bigger and smaller from its mean by its std
df = df[df['Page_Popularity'] < df['Page_Popularity'].mean() + df['Page_Popularity'].std()]
df = df[df['Page_Popularity'] > df['Page_Popularity'].mean() - df['Page_Popularity'].std()]
df = df[df['Target_Post_Share_Count'] < df['Target_Post_Share_Count'].mean() + df['Target_Post_Share_Count'].std()]
df = df[df['Target_Post_Share_Count'] > df['Target_Post_Share_Count'].mean() - df['Target_Post_Share_Count'].std()]
df = df[df['Page_Talking_About'] < df['Page_Talking_About'].mean() + df['Page_Talking_About'].std()]
df = df[df['Page_Talking_About'] > df['Page_Talking_About'].mean() - df['Page_Talking_About'].std()]
X = df[['Page_Popularity']].values
y = df['Target_Post_Share_Count'].values
z = df['Page_Talking_About'].values
# checking the min and max values after removing outliers
print('')
print('-------------AFTER REMOVING OUTLIERS-------------')
print('Max value of X: '+ str(X.max()))
print('MEAN value of X: '+ str(X.mean()))
print('Min value of X: '+ str(X.min()))
print('STD value of X: '+ str(X.std()))
print('')
print('Max value of y: '+ str(y.max()))
print('MEAN value of y: '+ str(y.mean()))
print('Min value of y: '+ str(y.min()))
print('STD value of y: '+ str(y.std()))
print('')
print('Max value of z: '+ str(z.max()))
print('MEAN value of z: '+ str(z.mean()))
print('Min value of z: '+ str(z.min()))
print('STD value of z: '+ str(z.std()))
print('')
-------------BEFORE REMOVING OUTLIERS------------- Max value of X: 486972297 MEAN value of X: 1313813.7475396225 Min value of X: 36 STD value of X: 6785668.891206464 Max value of y: 6089942 MEAN value of y: 117.24982295049941 Min value of y: 1 STD value of y: 944.9951281551448 Max value of z: 6089942 MEAN value of z: 44800.2517033383 Min value of z: 0 STD value of z: 110932.44302159699 -------------AFTER REMOVING OUTLIERS------------- Max value of X: 7373665 MEAN value of X: 567241.5275597319 Min value of X: 36 STD value of X: 989462.8213423738 Max value of y: 541 MEAN value of y: 38.546095227181574 Min value of y: 1 STD value of y: 75.93529241294107 Max value of z: 104226 MEAN value of z: 18226.58477052656 Min value of z: 0 STD value of z: 26393.53724643607
In the previous iteration I used Standardization to scale the features. Now I want to try the most used technique in ML industry -> Min-Max Normalization
There are two main reasons that support the need for scaling:
scaler = MinMaxScaler() # initiating scaler
df['Page_Popularity(scaled)']=scaler.fit_transform(df[['Page_Popularity']])
df['Page_Talking_About(scaled)']=scaler.fit_transform(df[['Page_Talking_About']])
df['Target_Post_Share_Count(scaled)']=scaler.fit_transform(df[['Target_Post_Share_Count']])
# Print Dataframe
df
| feature | Page_Popularity | Page_Talking_About | Target_Post_Share_Count | Page_Popularity(scaled) | Page_Talking_About(scaled) | Target_Post_Share_Count(scaled) |
|---|---|---|---|---|---|---|
| 0 | 634995 | 463 | 2 | 0.086112 | 0.004442 | 0.001852 |
| 1 | 634995 | 463 | 1 | 0.086112 | 0.004442 | 0.000000 |
| 2 | 634995 | 463 | 2 | 0.086112 | 0.004442 | 0.001852 |
| 3 | 634995 | 463 | 1 | 0.086112 | 0.004442 | 0.000000 |
| 4 | 634995 | 463 | 5 | 0.086112 | 0.004442 | 0.007407 |
| ... | ... | ... | ... | ... | ... | ... |
| 40919 | 309914 | 5432 | 15 | 0.042025 | 0.052118 | 0.025926 |
| 40920 | 309914 | 5432 | 89 | 0.042025 | 0.052118 | 0.162963 |
| 40921 | 309914 | 5432 | 77 | 0.042025 | 0.052118 | 0.140741 |
| 40922 | 309914 | 5432 | 52 | 0.042025 | 0.052118 | 0.094444 |
| 40923 | 309914 | 5432 | 107 | 0.042025 | 0.052118 | 0.196296 |
34906 rows × 6 columns
In this step I select the features for my model. They are already chosen and explained in previous points. Additionally, I plot two features to visualize their correlation.
X = df[['Page_Popularity(scaled)','Page_Talking_About(scaled)']]
y = df['Target_Post_Share_Count']
print('The length of X: {}'.format(len(X)))
print('The length of y: {}'.format(len(y)))
# Create a scatter plot
fig = px.scatter(df, x=df['Page_Talking_About(scaled)'], y=df['Page_Popularity(scaled)'],
opacity=0.8, color_discrete_sequence=['black'])
# Change chart background color
fig.update_layout(dict(plot_bgcolor = 'white'))
# Update axes lines
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey',
zeroline=True, zerolinewidth=1, zerolinecolor='lightgrey',
showline=True, linewidth=1, linecolor='black')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey',
zeroline=True, zerolinewidth=1, zerolinecolor='lightgrey',
showline=True, linewidth=1, linecolor='black')
# Update marker size
fig.update_traces(marker=dict(size=10))
fig.show()
The length of X: 34906 The length of y: 34906
It looks like a lot of values are concentrated in 0-0.2 area and slowly extend their proximity as x and y increases.
However, the scale is tricky and should be more a squared shape.
6.1.4 Dividing data into a training and test set
¶I am using the train_test_split() function to divide the dataset into training and testing pieces. I choose to use test size of 0.3 as the correlation is still pretty low
and I want to sacrifice more data for training.
# Split in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
print('Train shape:', X_train.shape, y_train.shape)
print('Test shape:', X_test.shape, y_test.shape)
Train shape: (24434, 2) (24434,) Test shape: (10472, 2) (10472,)
from sklearn import svm
from sklearn.model_selection import GridSearchCV
parameters = {'kernel': ('rbf'), 'C':[50,100,150,200,250,300,400],'epsilon':[1,2,3,4,5,6,7,8,9,10]}
svr = svm.SVR()
clf = GridSearchCV(svr, parameters)
clf.fit(X_train, y_train)
clf.best_params_
The above chunk output took over 6 hour to compute and resulted with C:200 and epsilon:5
# Define models and set hyperparameter values
model1 = LinearRegression()
model2 = SVR(kernel='rbf', C=200, epsilon=5)
# Fit the two models
model1.fit(X_train, y_train)
model2.fit(X_train, y_train)
# create prediction
y_pred = model2.predict(X_test)
kernel='rbf'
kernel{‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’} or callable, default=’rbf’
RBF kernels are the most generalized form of kernelization and is one of the most widely used kernels due to its similarity to the Gaussian distribution. The RBF kernel function for two points X₁ and X₂ computes the similarity or how close they are to each other.
"RBF Kernel is popular because of its similarity to K-Nearest Neighborhood Algorithm. It has the advantages of K-NN and overcomes the space complexity problem as RBF Kernel Support Vector Machines just needs to store the support vectors during training and not the entire dataset."
C=200
The C parameter tells the SVM optimization how much I want to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points. For very tiny values of C, I should get misclassified examples, often even if my training data is linearly separable.
epsilon=5
From documentation: "Epsilon in the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value."
To be honest, this hyperparameter is a bit vague for me. However I know one thing for sure: The larger ϵ is, the larger errors I admit in my solution, and for now it is enough.
6.2.1 Visualization - Linear Regression Prediction Surface
¶I take up the challenge of 3d visualization, while there should be no problem with linear regression, the SVM will be much more advanced.
# ----------- For creating a prediction plane to be used in the visualization -----------
# Set Increments between points in a meshgrid
mesh_size = 0.05
# Identify min and max values for input variables
x_min, x_max = X['Page_Popularity(scaled)'].min(), X['Page_Popularity(scaled)'].max()
y_min, y_max = X['Page_Talking_About(scaled)'].min(), X['Page_Talking_About(scaled)'].max()
# Return evenly spaced values based on a range between min and max
xrange = np.arange(x_min, x_max, mesh_size)
yrange = np.arange(y_min, y_max, mesh_size)
# Create a meshgrid
xx, yy = np.meshgrid(xrange, yrange)
# Use models to create a prediciton plane --- Linear Regression
pred_LR = model1.predict(np.c_[xx.ravel(), yy.ravel()])
pred_LR = pred_LR.reshape(xx.shape)
# Use models to create a prediciton plane --- SVR
pred_svr = model2.predict(np.c_[xx.ravel(), yy.ravel()])
pred_svr = pred_svr.reshape(xx.shape)
After a decent amount of time I manage to find tutorials on how to create this 'layer' of model which is apparently called a plane. As occurs, creating a 3d model is super easy. Creating a plane is super complex.
# Visualizations
fig = px.scatter_3d(df, x=df['Page_Popularity(scaled)'], y=df['Page_Talking_About(scaled)'], z=df['Target_Post_Share_Count'],
opacity=0.8, color_discrete_sequence=['black'],
width=1000, height=900
)
# Set figure title and colors
fig.update_layout(title_text="Scatter 3D Plot with Linear Regression Prediction Surface",
scene_camera_eye=dict(x=1.5, y=1.5, z=0.25),
scene_camera_center=dict(x=0, y=0, z=-0.2),
scene = dict(xaxis=dict(backgroundcolor='white',
color='black',
gridcolor='lightgrey'),
yaxis=dict(backgroundcolor='white',
color='black',
gridcolor='lightgrey'
),
zaxis=dict(backgroundcolor='white',
color='black',
gridcolor='lightgrey')))
# Update marker size
fig.update_traces(marker=dict(size=2))
# Add prediction plane
fig.add_traces(go.Surface(x=xrange, y=yrange, z=pred_LR, name='LR',
colorscale=px.colors.sequential.Plotly3, showscale=True))
iplot(fig)
6.2.2 Visualization - SVR Prediction Surface
¶fig = px.scatter_3d(df, x=df['Page_Popularity(scaled)'], y=df['Page_Talking_About(scaled)'], z=df['Target_Post_Share_Count'],
opacity=0.8, color_discrete_sequence=['black'],
width=1000, height=900
)
# Set figure title and colors
fig.update_layout(title_text="Scatter 3D Plot with SVR Prediction Surface",
scene_camera_eye=dict(x=1.5, y=1.5, z=0.3),
scene_camera_center=dict(x=0, y=0, z=-0.2),
scene = dict(xaxis=dict(backgroundcolor='white',
color='black',
gridcolor='lightgrey'),
yaxis=dict(backgroundcolor='white',
color='black',
gridcolor='lightgrey'
),
zaxis=dict(backgroundcolor='white',
color='black',
gridcolor='lightgrey')))
# Update marker size
fig.update_traces(marker=dict(size=2))
# Add prediction plane
fig.add_traces(go.Surface(x=xrange, y=yrange, z=pred_svr, name='SVR',
colorscale=px.colors.sequential.Plotly3,
showscale=True))
#fig.show()
iplot(fig)
WOW, that looks sophisticated and professional.
I think those two plots ideally visualize why a SVM algorithm is better.
Multiple linear regression is a flat plane that tries to fit all data points with a single cut. SVR is more flexible, soft and can bend and fold in whatever way to fit the data better. This enables me to get a more accurate model.
from sklearn.metrics import mean_squared_error
from math import sqrt
error = sqrt(mean_squared_error(y_test, y_pred)) #calculate rmse
score = model2.score(X_train, y_train)
print('Number of test points: ',X_test.size)
print('RMSE value :', error)
print('Model score :', score)
Number of test points: 20944 RMSE value : 72.43599035886798 Model score : 0.052912418901942204
Well, the RMSE is higher than I expected and the score is lower than I expected.
For now, this is the worst performing model in this project. The kNN did a much better job. After playing with hyperparameters I noticed that they highly influence those numbers (FYI: I knew it way before obviously) and changing them improves the score. There must be some convenient way of choosing the best ones like I did for kNN in previous iteration. Will have to research it.
At this moment I have no other clue how to improve this model other than the hyperparameter tuning. I am still a SVM geek and this was the first time I used this algorithm. I will seek improvements after receiving feedback from a ML expert.
Go back to Table of contents.
Delivery is the last phase of every AI project and focuses on deployment of ist solution and reporting. The key element is to put my model to the test by demonstrating it to my stakeholder. The delivery phase is completed when the feedback from my stakeholders is incorporated in a final submission.
I will start with model selection, choosing 1 out of 3 I have created during this project. Basing on evaluation of each one, the best algorithm will be chosen for deployment.
Next step is to create a fully working AI prototype witch takes a user input and outputs the predicted value.
Once the application is created, it will be sent for field testing to the project stakeholder who will provide me with feedback and ideas for improvements.
After incorporating those into the prototype, I can move to the next step - Collecting and Documenting in which I will gather the most important information about the application.
The last milestone would be to present the final product again to the stakeholder and provide him with a project report.
Further, the project is marked as complete.
During this project, I have produced 3 different Machine Learning models aiming at predicting the Facebooks' post share volume based on selected features. Experimenting with different algorithms gives me an opportunity to chose the best performing one and implement it into a prototype.
List of available models:
As all models are created using a regression algorithms and predict a continuos variable, model selection based on accuracy score (percentage) is impossible as this applies only for classification problems. That is why, the selection will be performed on the following scores:
Every model was trained and tested on the same dataset and with the same train/test split division. Additionally, a grid search was performed for every model assuring the most optimal hyperparameter values influencing its performance.
The best performing model with both the lowest RMSE and the highest Score, outperforming other models in numbers, is:
k-Nearest Neighbors (Regressor)
which is going to be deployed and put into a field testing.
In this part I create a a fully working AI prototype in a form of a web application. The product has the following requirements:
import pickle
# save the model to disk
filename = 'kNN_model.sav'
pickle.dump(model, open(filename, 'wb'))
# load the model from disk
loaded_model = pickle.load(open('kNN_model.sav', 'rb'))
# print model accuracy
result = loaded_model.score(X_train_fb_o, y_train_fb_o)
print('MODEL SCORE: ',round(result*100,2),'%')
MODEL SCORE: 33.65 %
print(loaded_model.predict([[1000,3]])*10000)
[3789.41814747]
Go back to Table of contents.
This section includes all feedback I received on this project. The idea is to make it transparent and easily accessible.
Feedback Iteration 2 + Addressing it in Iteration 3 (Placeholder)
¶- Data analytics & Investigative analysis
Go back to Table of contents.
Need of feature scaling in machine learning. (n.d.).
Retrieved April 13, 2022, from https://www.enjoyalgorithms.com/blog/need-of-feature-scaling-in-machine-learning
Dobilas, S. (2022, February 12). Support vector regression (SVR) - one of the most flexible yet robust prediction algorithms. Medium.
Retrieved April 13, 2022, from https://towardsdatascience.com/support-vector-regression-svr-one-of-the-most-flexible-yet-robust-prediction-algorithms-4d25fbdaca60
Sreenivasa, S. (2020, October 12). Radial basis function (RBF) kernel: The go-to kernel. Medium.
Retrieved April 13, 2022, from https://towardsdatascience.com/radial-basis-function-rbf-kernel-the-go-to-kernel-acf0d22c798a
Built-in. Built-in continuous color scales with Python. (n.d.).
Retrieved April 13, 2022, from https://plotly.com/python/builtin-colorscales/
Go back to Table of contents.